The goal of this analysis is to examine proteomic data from HeLa cells obtained via fractionation experiments with and without RNase treatment. The workflow includes data quality control, normalization, and visualization of signal distribution across fractions. Changes in protein distribution upon RNase treatment are assessed to draw conclusions about the biophysical and functional characteristics of RNA-dependent proteins.
MS_Table = read.table("Datensätze/RDeeP_HeLa_Interphase.csv", header=TRUE, row.names=1, sep = ";")
# for Plotting & Visualization:
library(ggplot2) # For creating versatile, layered graphics and plots.
library(pheatmap) # For heatmaps
library(RColorBrewer) # Provides nice color palettes that can be used in plots for better aesthetics and color consistency.
library(grDevices) # Contains functions for drawing polygons and other graphic primitives
library(DiagrammeR) # For creating diagrams and flowcharts programmatically using graph syntax.
library(DiagrammeRsvg) # Works with DiagrammeR to export diagrams as SVG.
library(rsvg) # Converts SVG images (such as from DiagrammeRsvg) into other formats (e.g., PNG).
# for Data Manipulation & Tables
library(dplyr) # For data wrangling — filtering, selecting, grouping, summarizing data frames with readable syntax.
library(tidyr) # For reshaping and tidying data, e.g., converting between wide and long formats.
library(readr) # For fast and friendly reading of flat files (CSV, TSV).
# for Statistics & Clustering & PCA
library(cluster) # For cluster analysis — calculating silhouette widths
library(factoextra) # Provides functions to extract and visualize results of multivariate data analyses like PCA, clustering, and helps create elbow and silhouette plots.
library(pracma) # Provides practical numerical math functions — here used for area calculation with trapezoidal rule
library(effects) # For calculating and plotting effects of predictors in regression models (helps interpret model results)
A consistent and visually appealing color palette is used throughout the code to distinguish data categories clearly.
# Hexcodes for the powerpoint-paltte, using for the poster
"#A786C2"; "#8684C2"; "#617297"; "#7396AD"; "#8CABB5"; "#728083"
"#EDE6F2"; "#E7E6F2"; "#DFE2EA"; "#E2EAEE"; "#E7EEF0"; "#E2E5E6"
"#DBCEE6"; "#CFCDE6"; "#BFC6D5"; "#C6D4DE"; "#D0DDE0"; "#C5CCCD"
"#C9B6D9"; "#B7B5DA"; "#9FAAC1"; "#AABFCC"; "#BACCD2"; "#A9B3B5"
"#7E51A4"; "#504DA4"; "#485571"; "#527286"; "#628691"; "#556062"
"#54366D"; "#35336D"; "#30394B"; "#364B5A"; "#415961"; "#394041"
# Color palette for treatments
treatment_colors = c("Control" = "#8CABB5", "RNase" = "#A786C2")
# Color palette for shifts
shift_colors = c(
"strong_left_shift" = "#7E51A4", "moderate_left_shift" = "#C9B6D9","no_shift" = "#D0DDE0","moderate_right_shift" = "#9FAAC1","strong_right_shift" = "#485571")
# Color for gradient1
gradient_colour1 = c("#C9B6D9", "#A786C2", "#7E51A4", "#504DA4", "#30394B")
# Color for gradient2
gradient_colour2 = c("#C6D4DE", "#7396AD", "#527286", "#485571", "#394041")
# Color for RBPs & non_RBPs
color_rbp = c(RBP = "#504DA4",NO_RBP = "#7396AD")
# Color for Clustering
cluster_colors = c("#7E51A4", "#504DA4", "#485571","#7396AD", "#DBCEE6", "#A9B3B5")
# Poster-Theme
theme_poster = theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
axis.title = element_text(face = "bold"),
axis.text.x = element_text(size = 9),
legend.title = element_text(face = "bold"),
legend.position = "right"
)
The goal of the data cleanup step is to prepare the raw mass spectrometry data for reliable downstream analysis by addressing technical variability and ensuring the data’s overall quality and consistency. Mass spectrometry measurements often contain noise and systematic differences due to sample handling, instrument fluctuations, or experimental conditions. These technical variations can obscure true biological signals and lead to incorrect conclusions if not properly controlled.
MS_Table = MS_Table[setdiff(rownames(MS_Table), 'FHOD3_HUMAN'),]
The goal of this normalization workflow is to reduce technical variation across replicates and enable meaningful comparison of protein distributions across fractions and conditions (e.g., control vs. RNase-treated). By aligning replicate intensities and scaling each profile to a common reference, we ensure that biological patterns are not obscured by measurement noise or loading differences.
The diagram below summarizes the key steps of the data cleaning and normalization process:
This workflow ensures consistent, comparable profiles across all proteins and conditions.
# Setup consistent names for fractions, mean tables, replicates, and treatments. These predefined names streamline access and automate processing throughout the analysis workflow.
Fraktionsnamen = c('Fraction1_Ctrl', 'Fraction1_RNAse', 'Fraction2_Ctrl', 'Fraction2_RNAse','Fraction3_Ctrl', 'Fraction3_RNAse','Fraction4_Ctrl', 'Fraction4_RNAse','Fraction5_Ctrl', 'Fraction5_RNAse','Fraction6_Ctrl', 'Fraction6_RNAse','Fraction7_Ctrl', 'Fraction7_RNAse','Fraction8_Ctrl', 'Fraction8_RNAse','Fraction9_Ctrl', 'Fraction9_RNAse','Fraction10_Ctrl', 'Fraction10_RNAse','Fraction11_Ctrl', 'Fraction11_RNAse','Fraction12_Ctrl', 'Fraction12_RNAse','Fraction13_Ctrl', 'Fraction13_RNAse','Fraction14_Ctrl', 'Fraction14_RNAse','Fraction15_Ctrl', 'Fraction15_RNAse','Fraction16_Ctrl', 'Fraction16_RNAse','Fraction17_Ctrl', 'Fraction17_RNAse','Fraction18_Ctrl', 'Fraction18_RNAse','Fraction19_Ctrl', 'Fraction19_RNAse','Fraction20_Ctrl', 'Fraction20_RNAse','Fraction21_Ctrl', 'Fraction21_RNAse','Fraction22_Ctrl', 'Fraction22_RNAse','Fraction23_Ctrl', 'Fraction23_RNAse','Fraction24_Ctrl', 'Fraction24_RNAse','Fraction25_Ctrl', 'Fraction25_RNAse')
meantabellen = c('mean_Fraction1_Ctrl', 'mean_Fraction1_RNAse', 'mean_Fraction2_Ctrl', 'mean_Fraction2_RNAse','mean_Fraction3_Ctrl', 'mean_Fraction3_RNAse','mean_Fraction4_Ctrl', 'mean_Fraction4_RNAse','mean_Fraction5_Ctrl', 'mean_Fraction5_RNAse','mean_Fraction6_Ctrl', 'mean_Fraction6_RNAse','mean_Fraction7_Ctrl', 'mean_Fraction7_RNAse','mean_Fraction8_Ctrl', 'mean_Fraction8_RNAse','mean_Fraction9_Ctrl', 'mean_Fraction9_RNAse','mean_Fraction10_Ctrl', 'mean_Fraction10_RNAse','mean_Fraction11_Ctrl', 'mean_Fraction11_RNAse','mean_Fraction12_Ctrl', 'mean_Fraction12_RNAse','mean_Fraction13_Ctrl', 'mean_Fraction13_RNAse','mean_Fraction14_Ctrl', 'mean_Fraction14_RNAse','mean_Fraction15_Ctrl', 'mean_Fraction15_RNAse','mean_Fraction16_Ctrl', 'mean_Fraction16_RNAse','mean_Fraction17_Ctrl', 'mean_Fraction17_RNAse','mean_Fraction18_Ctrl', 'mean_Fraction18_RNAse','mean_Fraction19_Ctrl', 'mean_Fraction19_RNAse','mean_Fraction20_Ctrl', 'mean_Fraction20_RNAse','mean_Fraction21_Ctrl', 'mean_Fraction21_RNAse','mean_Fraction22_Ctrl', 'mean_Fraction22_RNAse','mean_Fraction23_Ctrl', 'mean_Fraction23_RNAse','mean_Fraction24_Ctrl', 'mean_Fraction24_RNAse','mean_Fraction25_Ctrl', 'mean_Fraction25_RNAse')
Replikatsnamen = c('Ctrl1_norm_mittel','Ctrl2_norm_mittel','Ctrl3_norm_mittel','RNAse1_norm_mittel','RNAse2_norm_mittel','RNAse3_norm_mittel')
SW_Replikatsnamen = c('SW_Ctrl1','SW_Ctrl2','SW_Ctrl3','SW_RNAse1','SW_RNAse2','SW_RNAse3')
SW_Replikatsnamen_norm = c('SW_Ctrl1_norm','SW_Ctrl2_norm','SW_Ctrl3_norm','SW_RNAse1_norm','SW_RNAse2_norm','SW_RNAse3_norm')
Treatmentnamen = c('Ctrl1_norm','Ctrl2_norm','Ctrl3_norm','RNAse1_norm','RNAse2_norm','RNAse3_norm')
Proteinnamen = rownames(MS_Table)
for (i in 1:50) {
spalten_index = ((i-1) * 3 + 1):(i * 3) # Each loop processes 3 columns together
teilmatrix = MS_Table[,spalten_index]
Mittelwerte = colMeans(teilmatrix)
assign(paste0('mean_', Fraktionsnamen[i]), Mittelwerte) # Automatically assigns correct name
}
normf.fraktion = c()
for (i in 1:50) {
# Calculate pairwise differences of the individual means
abstand_12 = abs(get(meantabellen[i])[1] - get(meantabellen[i])[2])
abstand_13 = abs(get(meantabellen[i])[1] - get(meantabellen[i])[3])
abstand_23 = abs(get(meantabellen[i])[2] - get(meantabellen[i])[3])
# Determine the smallest difference and compute the mean of the closest pair
if (abstand_12 <= abstand_13 && abstand_12 <= abstand_23) {
normf.fraktion[i] = mean(c(get(meantabellen[i])[1], get(meantabellen[i])[2]))
} else if (abstand_13 <= abstand_12 && abstand_13 <= abstand_23) {
normf.fraktion[i] = mean(c(get(meantabellen[i])[1], get(meantabellen[i])[3]))
} else {
normf.fraktion[i] = mean(c(get(meantabellen[i])[2], get(meantabellen[i])[3]))
}
}
# Divide the normalization factor per fraction by each replicate’s mean
normf.rep = c()
for (i in 1:50) {
for (j in 1:3) {
normf.rep = c(normf.rep, normf.fraktion[i] / get(meantabellen[i])[j])
}
}
# Apply replicate-specific normalization factors to each corresponding replicate
MS_Table_norm_mittel = as.data.frame(matrix(NA, nrow = nrow(MS_Table), ncol = 150))
for (i in 1:150) {
MS_Table_norm_mittel[,i] = normf.rep[i] * MS_Table[,i]
}
colnames(MS_Table_norm_mittel) = colnames(MS_Table)
rownames(MS_Table_norm_mittel) = rownames(MS_Table)
# Create subtables for individual replicates
for (i in 1:6) {
spalten_index = seq(from = i, to = 150, by = 6)
teilmatrix = MS_Table_norm_mittel[, spalten_index]
assign(Replikatsnamen[i], teilmatrix)
}
# Apply sliding window
for (i in 1:6) {
teilmatrix = data.frame(
get(Replikatsnamen[i])[1],
(get(Replikatsnamen[i])[1:23] + get(Replikatsnamen[i])[2:24] + get(Replikatsnamen[i])[3:25]) / 3,
get(Replikatsnamen[i])[25]
)
assign(SW_Replikatsnamen[i], teilmatrix) # Automatically assigns the correct name
}
for (i in 1:6) {
teilmatrix = (get(SW_Replikatsnamen[i])) * 100 / rowSums(get(SW_Replikatsnamen[i]))
teilmatrix[is.na(teilmatrix)] = 0 # Handles cases where a fraction's total is 0 → prevents division
assign(SW_Replikatsnamen_norm[i], teilmatrix) # Automatically assigns the correct name
}
matrices = list(SW_Ctrl1_norm, SW_Ctrl2_norm, SW_Ctrl3_norm, SW_RNAse1_norm, SW_RNAse2_norm, SW_RNAse3_norm)
MS_Table_norm = c()
for (i in 1:25) {
for (j in 1:6) {
MS_Table_norm = as.data.frame(cbind(MS_Table_norm, matrices[[j]][, i]))
}
}
colnames(MS_Table_norm) = colnames(MS_Table)
rownames(MS_Table_norm) = rownames(MS_Table)
# Calculate mean for each column (raw and normalized)
mean_raw = colMeans(MS_Table, na.rm = TRUE)
mean_norm = colMeans(MS_Table_norm, na.rm = TRUE)
# Function to extract metadata from column names
extract_meta = function(sample_names) {
meta_df = data.frame(Sample = sample_names, stringsAsFactors = FALSE)
meta_df$Fraction = sub("Fraction(\\d+)_.*", "\\1", meta_df$Sample)
meta_df$Fraction = factor(meta_df$Fraction, levels = as.character(1:25))
meta_df$Condition = ifelse(grepl("Ctrl", meta_df$Sample), "Control", "RNase")
meta_df$Replicate = sub(".*_Rep(\\d+)", "\\1", meta_df$Sample)
meta_df$Replicate = factor(meta_df$Replicate, levels = c("1", "2", "3"))
return(meta_df)
}
# Get metadata for raw and normalized data
meta_raw = extract_meta(names(mean_raw))
meta_norm = extract_meta(names(mean_norm))
# Create data frames for raw and normalized means
df_raw = data.frame(Sample = names(mean_raw), Value = mean_raw, Normalization = "Before") %>%
left_join(meta_raw, by = "Sample")
df_norm = data.frame(Sample = names(mean_norm), Value = mean_norm, Normalization = "After") %>%
left_join(meta_norm, by = "Sample")
# Combine both data frames
df_all = bind_rows(df_raw, df_norm)
# Percent scale (normalized data is already percent, raw data scaled to percent for comparison)
df_all = df_all %>%
group_by(Normalization, Condition, Replicate) %>%
mutate(Value_scaled = ifelse(Normalization == "Before", Value / max(Value, na.rm = TRUE) * 100, Value)) %>%
ungroup()
# Set factor levels to ensure 'Before' is on the left, 'After' on the right
df_all$Normalization = factor(df_all$Normalization, levels = c("Before", "After"))
# Plot with replicate distinction using line type and point shape
Normalization_plot = ggplot(df_all, aes(x = Fraction, y = Value_scaled, color = Condition, group = interaction(Condition, Replicate))) +
geom_line(aes(linetype = Replicate), alpha = 0.7) +
geom_point(aes(shape = Replicate), size = 2) +
facet_wrap(~Normalization, ncol = 2, scales = "free_y", strip.position = "top") + # Title on top
theme_poster +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
strip.placement = "outside",
strip.background = element_blank(),
legend.position = "bottom"
) +
scale_color_manual(values = treatment_colors) +
scale_linetype_manual(values = c("1" = "solid", "2" = "solid", "3" = "solid")) +
scale_shape_manual(values = c(16, 17, 15)) +
labs(
title = "Mean Intensities by Fraction (Raw vs. Normalized)",
x = "Fraction",
y = "Mean Intensity (%)",
color = "Condition",
linetype = "Replicate",
shape = "Replicate"
)
# Save plot as PNG
ggsave(filename = "Postervisuals/Normalization_plot.png", plot = Normalization_plot, width = 16.5, height = 10.2, units = "cm", dpi = 300)
print(Normalization_plot)
Reproducibility across biological replicates ensures that the observed protein distribution patterns are reliable and not due to technical noise. High reproducibility is critical for downstream analyses like peak detection and condition comparisons. It acts as a quality control step to confirm that our experimental setup yields consistent results.
We use Pearson correlation because it measures linear similarity between replicates, which fits our expectation for consistent protein profiles. Unlike Spearman, Pearson is less affected by the many zero values in our data and gives more stable, meaningful results for our type of intensity measurements.
# Vector for correlations between columns 1 & 2
kor_Spalte1_Spalte2 = c()
for (i in 1:50) {
spalten_index = ((i-1) * 3 + 1):(i * 3) # every loop: 3 columns grouped
teilmatrix = MS_Table_norm[,spalten_index] # submatrix for each fraction
kor_a = cor(teilmatrix[,1], teilmatrix[,2], method = 'pearson') # Pearson correlation between column 1 & 2
kor_Spalte1_Spalte2 = c(kor_Spalte1_Spalte2, kor_a) # append to vector
}
# Vector for correlations between columns 2 & 3
kor_Spalte2_Spalte3 = c()
for (i in 1:50) {
spalten_index = ((i-1) * 3 + 1):(i * 3) # group 3 columns per loop
teilmatrix = MS_Table_norm[,spalten_index] # submatrix for each fraction
kor_b = cor(teilmatrix[,2], teilmatrix[,3], method = 'pearson') # Pearson correlation between column 2 & 3
kor_Spalte2_Spalte3 = c(kor_Spalte2_Spalte3, kor_b) # append to vector
}
# Vector for correlations between columns 1 & 3
kor_Spalte1_Spalte3 = c()
for (i in 1:50) {
spalten_index = ((i-1) * 3 + 1):(i * 3) # group 3 columns per loop
teilmatrix = MS_Table_norm[,spalten_index] # submatrix for each fraction
kor_c = cor(teilmatrix[,1], teilmatrix[,3], method = 'pearson') # Pearson correlation between column 1 & 3
kor_Spalte1_Spalte3 = c(kor_Spalte1_Spalte3, kor_c) # append to vector
}
kor_tabelle = cbind(kor_Spalte1_Spalte2, kor_Spalte2_Spalte3, kor_Spalte1_Spalte3)
rownames(kor_tabelle) = Fraktionsnamen # assign fraction names as row names
kor_tabelle
## kor_Spalte1_Spalte2 kor_Spalte2_Spalte3 kor_Spalte1_Spalte3
## Fraction1_Ctrl 0.9922913 0.9884693 0.9793272
## Fraction1_RNAse 0.9964996 0.9972000 0.9953063
## Fraction2_Ctrl 0.9982441 0.9973272 0.9961067
## Fraction2_RNAse 0.9974460 0.9978123 0.9979283
## Fraction3_Ctrl 0.9980140 0.9957821 0.9947088
## Fraction3_RNAse 0.9964020 0.9967907 0.9960179
## Fraction4_Ctrl 0.9976834 0.9946320 0.9942809
## Fraction4_RNAse 0.9959041 0.9953702 0.9957126
## Fraction5_Ctrl 0.9977046 0.9969456 0.9957746
## Fraction5_RNAse 0.9960425 0.9942320 0.9951696
## Fraction6_Ctrl 0.9971365 0.9976543 0.9957437
## Fraction6_RNAse 0.9946160 0.9933350 0.9936183
## Fraction7_Ctrl 0.9957639 0.9957994 0.9925912
## Fraction7_RNAse 0.9938465 0.9888819 0.9883642
## Fraction8_Ctrl 0.9960893 0.9958751 0.9916406
## Fraction8_RNAse 0.9938301 0.9906699 0.9920220
## Fraction9_Ctrl 0.9938336 0.9956706 0.9885944
## Fraction9_RNAse 0.9935125 0.9913715 0.9903113
## Fraction10_Ctrl 0.9966801 0.9968259 0.9923447
## Fraction10_RNAse 0.9959630 0.9943337 0.9927875
## Fraction11_Ctrl 0.9948911 0.9955383 0.9874038
## Fraction11_RNAse 0.9951868 0.9939271 0.9939221
## Fraction12_Ctrl 0.9889320 0.9946964 0.9804144
## Fraction12_RNAse 0.9946670 0.9939933 0.9946543
## Fraction13_Ctrl 0.9899478 0.9955592 0.9840633
## Fraction13_RNAse 0.9945411 0.9932438 0.9941475
## Fraction14_Ctrl 0.9948072 0.9963415 0.9897479
## Fraction14_RNAse 0.9967785 0.9952474 0.9968806
## Fraction15_Ctrl 0.9983127 0.9969126 0.9947427
## Fraction15_RNAse 0.9966497 0.9960483 0.9972090
## Fraction16_Ctrl 0.9976604 0.9964836 0.9950395
## Fraction16_RNAse 0.9952361 0.9948375 0.9957301
## Fraction17_Ctrl 0.9956301 0.9954071 0.9927168
## Fraction17_RNAse 0.9938339 0.9938537 0.9919904
## Fraction18_Ctrl 0.9949503 0.9949853 0.9882183
## Fraction18_RNAse 0.9893265 0.9920238 0.9913685
## Fraction19_Ctrl 0.9969580 0.9913472 0.9871498
## Fraction19_RNAse 0.9872632 0.9905200 0.9909264
## Fraction20_Ctrl 0.9955724 0.9919665 0.9888960
## Fraction20_RNAse 0.9868062 0.9824749 0.9853667
## Fraction21_Ctrl 0.9968258 0.9966845 0.9939542
## Fraction21_RNAse 0.9833158 0.9847143 0.9702886
## Fraction22_Ctrl 0.9980331 0.9975115 0.9959636
## Fraction22_RNAse 0.9853036 0.9872573 0.9734973
## Fraction23_Ctrl 0.9971794 0.9965572 0.9957957
## Fraction23_RNAse 0.9727797 0.9773317 0.9746187
## Fraction24_Ctrl 0.9965897 0.9944233 0.9953608
## Fraction24_RNAse 0.9692828 0.9713144 0.9873473
## Fraction25_Ctrl 0.9971594 0.9813583 0.9808523
## Fraction25_RNAse 0.9450400 0.9626007 0.9750045
kor_long = data.frame(
Fraction = rep(Fraktionsnamen, 3),
Correlation = c(kor_Spalte1_Spalte2, kor_Spalte2_Spalte3, kor_Spalte1_Spalte3),
Pair = rep(c("Replicates 1&2", "Replicates 2&3", "Replicates 1&3"), each = 50)
)
Boxplot_Correlations = ggplot(kor_long, aes(x = Pair, y = Correlation, fill = Pair)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Reproducibility of Replicates (Pearson Correlation)",
x = "Replicate Pair",
y = "Correlation Coefficient") +
theme(legend.position = "none")
Violinplot_Correlations = ggplot(kor_long, aes(x = Pair, y = Correlation, fill = Pair)) +
geom_violin(trim = FALSE) +
geom_boxplot(width = 0.1, fill = "white") +
scale_fill_manual(values = c("#7E51A4", "#485571", "#504DA4")) +
theme_poster +
labs(
title = "Reproducibility of Replicates",
y = "Correlation Coefficient",
x = NULL # Remove X-axis title but keep labels
) +
theme(
axis.title.y = element_text(margin = margin(r = 15)), # space on Y-axis title
axis.title.x = element_blank(), # no X-axis title
axis.ticks.x = element_blank(), # no ticks on X-axis
axis.line.x = element_blank(), # no bottom axis line
legend.position = "none"
)
# Save plot as PNG
ggsave(filename = "Postervisuals/Violinplot_Correlations.png", plot = Violinplot_Correlations, width = 13.5, height = 9, units = "cm", dpi = 300)
print(Violinplot_Correlations)
# 1. Calculate median, ignore NAs
median_r = median(kor_tabelle, na.rm = TRUE)
# 2. Matrix of absolute differences to median
diffs = abs(kor_tabelle - median_r)
# 3. Index of minimal difference (NAs automatically ignored)
min_index = which(diffs == min(diffs, na.rm = TRUE), arr.ind = TRUE)[1, , drop = FALSE]
# 4. Extract fraction and pair (fallback if names missing)
median_fraction = rownames(kor_tabelle)[min_index[1]] %||% min_index[1]
median_pair = colnames(kor_tabelle)[min_index[2]] %||% min_index[2]
# 5. Output results
cat("Median correlation value:", round(median_r, 3), "\n")
## Median correlation value: 0.995
cat("Plot for fraction:", median_fraction, "| pair:", median_pair, "\n")
## Plot for fraction: Fraction12_RNAse | pair: kor_Spalte1_Spalte2
plot_correlation = function(data, col1, col2, color = "", save_path = "") {
# Filter complete cases
df = data %>%
select(all_of(c(col1, col2))) %>%
mutate(Protein = rownames(data)) %>%
pivot_longer(-Protein, names_to = "Sample", values_to = "Value") %>%
pivot_wider(names_from = Sample, values_from = Value) %>%
filter(complete.cases(.))
# Calculate correlation
r = cor(df[[col1]], df[[col2]])
r_text = paste0("R = ", round(r, 2))
# Create ggplot
p = ggplot(df, aes_string(x = col1, y = col2)) +
geom_point(alpha = 0.6, color = color) +
geom_smooth(method = "lm", se = FALSE, color = "#7E51A4") +
labs(title = paste("Scatterplot:", col1, "vs", col2),
x = col1,
y = col2) +
theme_poster +
annotate("text", x = min(df[[col1]], na.rm = TRUE),
y = max(df[[col2]], na.rm = TRUE),
label = r_text, hjust = 0, vjust = 1, size = 10, color = color)
# Show plot
print(p)
# Save plot optionally
if (save_path != "") {
ggsave(filename = save_path, plot = p, width = 8, height = 8)
}
}
# plot for median r-value:
plot_correlation(MS_Table_norm, "Fraction12_RNase_Rep1", "Fraction12_RNase_Rep2", color = "black", save_path = "Postervisuals/vergleich_plot_ggplot.png")
Now we have the normalized values for all proteins and can examine our dataset in detail. Our goal is to identify differences between the control and RNase samples. To do this, we first determine both absolute and local maxima for all proteins. To take it a step further, we want to create our own function that can assign each protein to a shift category based on criteria we define ourselves. So now, let’s take a closer look at our proteins.
for (i in 1:6) {
spalten_index = seq(from = i, to = 150, by = 6)
teilmatrix = MS_Table_norm[, spalten_index]
assign(Treatmentnamen[i], teilmatrix) # New subtables named correctly
}
# Mean values from the replicates
Mittelwert_matrix = t(apply(MS_Table_norm, 1, function(row) {
sapply(1:50, function(i) {
spalten_index = ((i - 1) * 3 + 1):(i * 3)
mean(row[spalten_index])
})
}))
# Creating two mean subtables for each treatment
Mittelwerte_Kontrolle = as.data.frame(Mittelwert_matrix[, seq(1, 50, by=2)])
Mittelwerte_RNAse = as.data.frame(Mittelwert_matrix[, seq(2, 50, by=2)])
# Naming sub-tables correctly
colnames(Mittelwerte_Kontrolle) = paste0(Fraktionsnamen[seq(1, 50, by=2)])
colnames(Mittelwerte_RNAse) = paste0(Fraktionsnamen[seq(1, 50, by=2)])
zeile = 'ACOT9_HUMAN'
# Kombiniere y-Werte beider Bedingungen
alle_werte = c(Mittelwerte_Kontrolle[zeile,], Mittelwerte_RNAse[zeile,])
ylim_bereich = range(alle_werte, na.rm = TRUE)
# Mehr Platz für die Legende
par(mar = c(5, 4, 6, 2))
# Plot für Kontrolle
plot(1:25, Mittelwerte_Kontrolle[zeile,], type = "l", col = treatment_colors["Control"],
xlab = "Fraction", ylab = "Intensity in %", main = "Protein distribution",
ylim = ylim_bereich)
# Plot für RNAse
lines(1:25, Mittelwerte_RNAse[zeile,], col = treatment_colors["RNase"])
# Legende einfügen
legend("topright", inset = c(0, 0.02),
legend = names(treatment_colors),
col = treatment_colors,
lty = 1, cex = 0.6,
bty = "n")
Now we determine the maximum values using a function that we can then apply to all proteins. The threshold value is set to 2 and, in addition to the local and absolute maxima, plateaus and edge maxima are also taken into account.
#Set up function and set threshold value to 2
find_maxima = function(mat, threshold = 2) {
result = do.call(rbind, lapply(1:nrow(mat), function(i) {
x = mat[i, ]
# normal Maxima: look at the first numerical derivative, if there is a sign change it is a maximum
idx = which(diff(sign(diff(x))) == -2) + 1
# Find plateau maxima: each plateau is checked for a local maximum, if the plateau value is greater than or equal to its direct neighbors, the average value of the plateau is calculated and saved
rle_x = rle(x) # Run-Length-Encoding
lengths = rle_x$lengths
values = rle_x$values
ends = cumsum(lengths)
starts = ends - lengths + 1
plateau_idx = c()
for (j in seq_along(values)) {
# only check Plateaus with lengths > 1
if (lengths[j] > 1) {
start_pos = starts[j]
end_pos = ends[j]
# values right and left of the plateau
left_val = if (start_pos > 1) x[start_pos - 1] else -Inf
right_val = if (end_pos < length(x)) x[end_pos + 1] else -Inf
# Plateau is a lokal Maximum, if it´s value >= neighbor
if (values[j] >= left_val && values[j] >= right_val) {
plateau_idx = c(plateau_idx, round(mean(start_pos:end_pos)))
}
}
}
# Combine normal maxima and plateau maxima
all_idx = sort(unique(c(idx, plateau_idx))) #unique: if values are duplicated they are removed, sort: sort the indices in ascending order
# check edge maxima
if (x[1] > x[2]) {all_idx = c(1, all_idx)}
if (x[length(x)] > x[length(x)-1]) {all_idx = c(all_idx, length(x))}
# only keep values > threshold = 2%
all_idx = all_idx[x[all_idx] > threshold]
# Clean up peaks: If two peaks have only one value between them, remove the smaller one
if (length(all_idx) >= 2) {
to_remove = c()
for (k in 1:(length(all_idx)-1)) {
if (all_idx[k+1] - all_idx[k] == 2) {
# Indices of the peaks
peak1 = all_idx[k]
peak2 = all_idx[k+1]
if (x[peak1] >= x[peak2]) {
to_remove = c(to_remove, peak2)
} else {
to_remove = c(to_remove, peak1)
}
}
} # remove the lower value
if (length(to_remove) > 0) {
all_idx = setdiff(all_idx, to_remove)
}
}
if (length(all_idx) > 0) {
data.frame(
Protein = Proteinnamen[i],
Zeile = i,
Fraktion = all_idx,
Wert = x[all_idx]
)
} else {
NULL
}
}))
rownames(result) = NULL
return(result)
}
# Apply maxima function to our mean value tables for both treatments
Mittelwerte_Kontrolle_mat = as.matrix(sapply(Mittelwerte_Kontrolle, as.numeric))
Mittelwerte_RNAse_mat = as.matrix(sapply(Mittelwerte_RNAse, as.numeric))
Maxima_Mittelwerte_Kontrolle = find_maxima(Mittelwerte_Kontrolle_mat)
Maxima_Mittelwerte_RNAse = find_maxima(Mittelwerte_RNAse_mat)
#Generate two sub-tables for absolute maxima for control and RNAse
abs_Maxima_Mittelwerte_Kontrolle = Maxima_Mittelwerte_Kontrolle %>%
group_by(Protein) %>% # Grouping by protein
slice_max(order_by = Wert, n = 1) %>% # Keep line with maximum maxima
ungroup()
abs_Maxima_Mittelwerte_RNAse = Maxima_Mittelwerte_RNAse %>%
group_by(Protein) %>% # Grouping by protein
slice_max(order_by = Wert, n = 1) %>% # Keep line with maximum maxima
ungroup()
Groups of adjacent fractions (at least 4 fractions next to each other) were sought that were not identified as maxima, but still showed a large amount of protein.
# Create a function for finding the shoulder points
find_shoulders = function(binary_mat, Proteinnamen) {
result = do.call(rbind, lapply(1:nrow(binary_mat), function(i) {
x = binary_mat[i, ]
rle_x = rle(x)
lengths = rle_x$lengths
values = rle_x$values
ends = cumsum(lengths)
starts = ends - lengths + 1
schulter_idx = c()
for (j in seq_along(values)) {
if (values[j] == 1 && lengths[j] >= 4) {
start_pos = starts[j]
end_pos = ends[j]
schulter_idx = c(schulter_idx, round(mean(start_pos:end_pos)))
}
}
if (length(schulter_idx) > 0) {
data.frame(
Protein = Proteinnamen[i],
Zeile = i,
Fraktion = schulter_idx
)
} else {
NULL
}
}))
rownames(result) = NULL
return(result)
}
#Temporary removal of protein quantity that is below threshold
Schulter_Mittelwerte_Kontrolle = ifelse(Mittelwerte_Kontrolle > 2, 1, 0) # if greater than 2 then it becomes 1 otherwise the value is set to 0
# Temporary removal of areas around maxima --> are set to 0: within a radius of 3
for (i in 1:nrow(Maxima_Mittelwerte_Kontrolle)) {
Protein = Maxima_Mittelwerte_Kontrolle$Zeile[i] # row (Protein)
Fraktion = Maxima_Mittelwerte_Kontrolle$Fraktion[i] # column (Fraction)
# Calculate range limits for the columns (check limits!)
#max(1, fraction-3) ensures that the start index is at least 1 (not <1)
#min(ncol(...), fraction+3) ensures that the final index is at most the total number of columns (ncol(...))
Fraktions_bereich = max(1, Fraktion - 3):min(ncol(Schulter_Mittelwerte_Kontrolle), Fraktion + 3)
# Set the ±3 columns to 0 only in this rows
Schulter_Mittelwerte_Kontrolle[Protein, Fraktions_bereich] = 0
}
#Apply function for shoulder points
Schulterpunkte_Kontrolle = find_shoulders(Schulter_Mittelwerte_Kontrolle, Proteinnamen)
#Temporary removal of protein quantity that is below threshold
Schulter_Mittelwerte_RNAse = ifelse(Mittelwerte_RNAse > 2, 1, 0) # if greater than 2 then it becomes 1 otherwise the value is set to 0
# Temporary removal of areas around maxima --> are set to 0: within a radius of 3
for (i in 1:nrow(Maxima_Mittelwerte_RNAse)) {
Protein = Maxima_Mittelwerte_RNAse$Zeile[i] # row (Protein)
Fraktion = Maxima_Mittelwerte_RNAse$Fraktion[i] # column (Fraktion)
#Calculate range limits for the columns (check limits!)
#max(1, fraction-3) ensures that the start index is at least 1 (not <1)
#min(ncol(...), fraction+3) ensures that the final index is at most the total number of columns (ncol(...))
Fraktions_bereich = max(1, Fraktion - 3):min(ncol(Schulter_Mittelwerte_RNAse), Fraktion + 3)
# Set the ±3 columns to 0 only in this rows
Schulter_Mittelwerte_RNAse[Protein, Fraktions_bereich] = 0
}
#Apply function for shoulder points
Schulterpunkte_RNAse = find_shoulders(Schulter_Mittelwerte_RNAse, Proteinnamen)
# Add values for shoulder points
Schulterpunkte_Kontrolle$Fraktion_gerundet = round(Schulterpunkte_Kontrolle$Fraktion)
Schulterpunkte_Kontrolle$Wert = mapply(function(zeile, fraktion) {
Mittelwerte_Kontrolle[zeile, fraktion]
}, Schulterpunkte_Kontrolle$Zeile, Schulterpunkte_Kontrolle$Fraktion_gerundet)
Schulterpunkte_RNAse$Fraktion_gerundet = round(Schulterpunkte_RNAse$Fraktion)
Schulterpunkte_RNAse$Wert = mapply(function(zeile, fraktion) {
Mittelwerte_RNAse[zeile, fraktion]
}, Schulterpunkte_RNAse$Zeile, Schulterpunkte_RNAse$Fraktion_gerundet)
# Merging maxima values and shoulder points for controls
Maxima_Mittelwerte_Kontrolle$Typ = "Maxima"
Schulterpunkte_Kontrolle$Typ = "Schulter"
Combined_Kontrolle = rbind(
Maxima_Mittelwerte_Kontrolle[, c("Protein", "Zeile", "Fraktion", "Wert", "Typ")],
Schulterpunkte_Kontrolle[, c("Protein", "Zeile", "Fraktion", "Wert", "Typ")]
)
Combined_Kontrolle = Combined_Kontrolle[order(Combined_Kontrolle$Zeile, Combined_Kontrolle$Fraktion), ]
# Merging maxima values and shoulder points for RNAse
Maxima_Mittelwerte_RNAse$Typ = "Maxima"
Schulterpunkte_RNAse$Typ = "Schulter"
Combined_RNAse = rbind(
Maxima_Mittelwerte_RNAse[, c("Protein", "Zeile", "Fraktion", "Wert", "Typ")],
Schulterpunkte_RNAse[, c("Protein", "Zeile", "Fraktion", "Wert", "Typ")]
)
Combined_RNAse = Combined_RNAse[order(Combined_RNAse$Zeile, Combined_RNAse$Fraktion), ]
… between control and RNAse maxima as the first criterion with which we later want to assign a shift category to the proteins
# Rename column names of sub-tables to be able to access them via Combined_Kontrolle and Combined_RNAse
Ctrl1_frakt = Ctrl1_norm
colnames(Ctrl1_frakt) = c(1:25)
Ctrl2_frakt = Ctrl2_norm
colnames(Ctrl2_frakt) = c(1:25)
Ctrl3_frakt = Ctrl3_norm
colnames(Ctrl3_frakt) = c(1:25)
RNAse1_frakt = RNAse1_norm
colnames (RNAse1_frakt) = c(1:25)
RNAse2_frakt = RNAse2_norm
colnames(RNAse2_frakt) = c(1:25)
RNAse3_frakt = RNAse3_norm
colnames(RNAse3_frakt) = c(1:25)
# form Lists from the replicates of control and RNAse
ctrl_reps = list(Ctrl1_frakt, Ctrl2_frakt, Ctrl3_frakt)
rnase_reps = list(RNAse1_frakt, RNAse2_frakt, RNAse3_frakt)
# Dataframe of control and RNAse with only the positions of the peaks
peaks_ctrl = data.frame(Protein = Combined_Kontrolle$Protein, Fraktion = Combined_Kontrolle$Fraktion, Treatment = 'Kontrolle')
peaks_rnase = data.frame(Protein = Combined_RNAse$Protein, Fraktion = Combined_RNAse$Fraktion, Treatment = 'RNase')
# determine same fraction combinations of control and RNAse for the maxima
shared_peaks = merge(peaks_ctrl, peaks_rnase, by = c("Protein", "Fraktion"))
### Determine P values for the various maxima
# Set up dataframe for the results
p_Werte = data.frame()
for (i in 1:nrow(shared_peaks)) {
protein = shared_peaks$Protein[i]
fraktion = as.character(shared_peaks$Fraktion[i])
# Extract control values: via position in shared peaks
ctrl_vals = sapply(ctrl_reps, function(df) {
if (protein %in% rownames(df) && fraktion %in% colnames(df)) {
return(df[protein, fraktion])
} else {
return(NA)
}
})
# Extract RNAse values: via position in shared peaks
rnase_vals = sapply(rnase_reps, function(df) {
if (protein %in% rownames(df) && fraktion %in% colnames(df)) {
return(df[protein, fraktion])
} else {
return(NA)
}
})
# remove NA values
ctrl_vals = na.omit(ctrl_vals)
rnase_vals = na.omit(rnase_vals)
# If necessary: Add mini fault with identical values, otherwise the variance = 0 and cannot be used for the t-test
if (length(ctrl_vals) >= 2 && all(abs(ctrl_vals - ctrl_vals[1]) < 1e-10)) {
ctrl_vals[2] = ctrl_vals[2] + 1e-6
}
if (length(rnase_vals) >= 2 && all(abs(rnase_vals - rnase_vals[1]) < 1e-10)) {
rnase_vals[2] = rnase_vals[2] + 1e-6
}
#t-test only if both groups have at least 2 values
# Assumption for the t-test that replicates are normally distributed in order to be able to perform the t-test at all
# Variance test with F-test to see whether control and RNAse have the same variance: important for t-test implementation
if (length(ctrl_vals) >= 2 && length(rnase_vals) >= 2) {
# Paired t-test
t_p = tryCatch(
t.test(ctrl_vals, rnase_vals, paired = TRUE)$p.value,
error = function(e) NA
)
} else {
t_p = NA
}
p_Werte = rbind(p_Werte, data.frame(Protein = protein, Fraktion = fraktion, P_Value = t_p))
}
# FDR-Correction: adjust all p-values in your overall dataset to control the false discovery rate.
p_Werte$adj_p = p.adjust(p_Werte$P_Value, method = "BH")
The following criteria are examined:
#extract only the columns “Fraction” and “Value” and split this 2-column data based on the grouping variable Combined_Control$Protein.Result: a list in which each element corresponds to a protein ID.
liste_maxima_Ctrl = split(Combined_Kontrolle[, c("Fraktion", "Wert")], Combined_Kontrolle$Protein)
liste_maxima_RNAse = split(Combined_RNAse[, c("Fraktion", "Wert")], Combined_RNAse$Protein)
#define the function
hol_shifts = function(protein, threshold_rel = 0.3) {
window = 2 # Size of the fraction window
# get the maximum values of the current protein
ctrl_maxima = liste_maxima_Ctrl[[protein]]
rnase_maxima = liste_maxima_RNAse[[protein]]
#If the protein is not present in one of the lists, the value is assigned the number zero
if (is.null(ctrl_maxima)) ctrl_maxima = data.frame(Fraktion = numeric(0), Wert = numeric(0))
if (is.null(rnase_maxima)) rnase_maxima = data.frame(Fraktion = numeric(0), Wert = numeric(0))
#if a value exists, the maximum of the values of this protein is calculated. Then all fractions whose value is at least over threshold are filtered. These are considered as “maxima”
if (nrow(ctrl_maxima) > 0) {
max_ctrl = max(ctrl_maxima$Wert)
ctrl_maxima = ctrl_maxima[ctrl_maxima$Wert >= threshold_rel * max_ctrl, ]
}
if (nrow(rnase_maxima) > 0) {
max_rnase = max(rnase_maxima$Wert)
rnase_maxima = rnase_maxima[rnase_maxima$Wert >= threshold_rel * max_rnase, ]
}
#Counting the maxima found
nb_ctrl_maxima = nrow(ctrl_maxima)
nb_rnase_maxima = nrow(rnase_maxima)
# Distances between the maxima
abstände = c()
for (i in 1:nb_ctrl_maxima) {
for (j in 1:nb_rnase_maxima) {
abstand = rnase_maxima$Fraktion[j] - ctrl_maxima$Fraktion[i]
abstände = c(abstände, abstand)
}
}
# Function for area calculation with trapezoidal rule to obtain protein quantity
calc_area = function(df_list, protein, center_frac, window) {
sapply(df_list, function(df) {
if (protein %in% rownames(df)) {
fracs = as.numeric(colnames(df))
idxs = which(fracs >= (center_frac - window) & fracs <= (center_frac + window))
if (length(idxs) >= 2) {
x_vals = fracs[idxs]
y_vals = as.numeric(df[protein, idxs])
y_vals = na.omit(y_vals)
if (length(y_vals) == length(x_vals) && length(y_vals) >= 2) {
return(trapz(x_vals, y_vals))
}
}
}
return(NA)
})
}
# Amplitude loss and area under control peaks
loss_list = c()
flaeche_kontrolle = c()
for (i in 1:nb_ctrl_maxima) {
fraktion_i = as.character(ctrl_maxima$Fraktion[i])
amp_ctrl = ctrl_maxima$Wert[i]
amp_rnase = Combined_RNAse$Wert[Combined_RNAse$Protein == protein & Combined_RNAse$Fraktion == fraktion_i]
if (length(amp_rnase) == 0) amp_rnase = 0
loss_list = c(loss_list, amp_ctrl - amp_rnase)
flächen_vals = calc_area(ctrl_reps, protein, as.numeric(fraktion_i), window)
flaeche_kontrolle = c(flaeche_kontrolle, mean(na.omit(flächen_vals)))
}
# Amplitude gain and area under RNAse peaks
gain_list = c()
flaeche_rnase = c()
for (i in 1:nb_rnase_maxima) {
fraktion_i = as.character(rnase_maxima$Fraktion[i])
amp_rnase = rnase_maxima$Wert[i]
amp_ctrl = Combined_Kontrolle$Wert[Combined_Kontrolle$Protein == protein & Combined_Kontrolle$Fraktion == fraktion_i]
if (length(amp_ctrl) == 0) amp_ctrl = 0
gain_list = c(gain_list, amp_rnase - amp_ctrl)
flächen_vals = calc_area(rnase_reps, protein, as.numeric(fraktion_i), window)
flaeche_rnase = c(flaeche_rnase, mean(na.omit(flächen_vals)))
}
# p-values for peaks from previously calculated p_values
p_values = sapply(ctrl_maxima$Fraktion, function(f) {
p_val = p_Werte$adj_p[p_Werte$Protein == protein & p_Werte$Fraktion == f]
if (length(p_val) == 0) NA else p_val[1]
})
return(list(
Protein = protein,
Anzahl_Kontroll_Maxima = nb_ctrl_maxima,
Anzahl_RNase_Maxima = nb_rnase_maxima,
Abstände_Maxima = abstände,
Shift_idx = sum(abstände),
Verlust_Amplituden = loss_list,
Gewinn_Amplituden = gain_list,
Summierter_Verlust = sum(loss_list),
Summierter_Gewinn = sum(gain_list),
Fläche_Kontrolle = flaeche_kontrolle,
Fläche_RNase = flaeche_rnase,
p_Werte = p_values
))
}
# Examples values for the funciton for the used protein
hol_shifts(zeile)
## $Protein
## [1] "ACOT9_HUMAN"
##
## $Anzahl_Kontroll_Maxima
## [1] 2
##
## $Anzahl_RNase_Maxima
## [1] 1
##
## $Abstände_Maxima
## [1] 0 -16
##
## $Shift_idx
## [1] -16
##
## $Verlust_Amplituden
## [1] -17.04502 17.00204
##
## $Gewinn_Amplituden
## [1] 17.04502
##
## $Summierter_Verlust
## [1] -0.0429812
##
## $Summierter_Gewinn
## [1] 17.04502
##
## $Fläche_Kontrolle
## [1] 24.55207 49.75274
##
## $Fläche_RNase
## [1] 74.67149
##
## $p_Werte
## [1] 0.02922084 NA
proteine = unique(c(names(liste_maxima_Ctrl), names(liste_maxima_RNAse)))
ergebnisse = lapply(proteine, function(p) {
res = hol_shifts(p)
data.frame(
Protein = res$Protein,
Anzahl_Kontroll_Maxima = res$Anzahl_Kontroll_Maxima,
Anzahl_RNase_Maxima = res$Anzahl_RNase_Maxima,
Shift_idx = res$Shift_idx,
Summierter_Verlust = res$Summierter_Verlust,
Summierter_Gewinn = res$Summierter_Gewinn,
Abstaende_Maxima = paste(res$Abstände_Maxima, collapse = ";"),
Verlust_Amplituden = paste(res$Verlust_Amplituden, collapse = ";"),
Gewinn_Amplituden = paste(res$Gewinn_Amplituden, collapse = ";"),
Fläche_Kontrolle = paste(res$Fläche_Kontrolle,collapse = ";"),
Fläche_RNase = paste(res$Fläche_RNase,collapse = ";"),
p_Werte = paste(res$p_Werte, collapse = ";")
)
})
df_ergebnisse = do.call(rbind, ergebnisse)
#Remove proteins
df_ergebnisse = df_ergebnisse[df_ergebnisse$Protein != 'P210L_HUMAN',]
df_ergebnisse = df_ergebnisse[df_ergebnisse$Protein != 'PKD1_HUMAN',]
df_ergebnisse = df_ergebnisse[df_ergebnisse$Protein != 'TGM7_HUMAN',]
df_ergebnisse = df_ergebnisse[df_ergebnisse$Protein != 'KIF1A_HUMAN',]
For each protein, we examined how many criteria it fulfilled and determined an individual shift score for each protein. Based on this score, the proteins can now be divided into the categories: “strong”, “moderate” or “no”-shift.
#create a function for calculation of the score
klassifiziere_protein_shift_streng_v2 = function(row) {
parse_numeric_vector = function(x) {
if (is.na(x) || is.null(x) || x == "") return(numeric(0))
suppressWarnings(as.numeric(unlist(strsplit(as.character(x), ";"))))
}
get_centroid = function(x) {
if (length(x) == 0 || all(is.na(x)) || all(x == 0)) return(NA_real_)
weighted.mean(seq_along(x), x, na.rm = TRUE)
}
score = 0
richtung = NA
### 1. category Shift_idx (max 2 points for score)
shift_idx = suppressWarnings(as.numeric(row[["Shift_idx"]]))
if (!is.na(shift_idx)) {
if (abs(shift_idx) >= 2) {
score = score + 2
# Shift_idx > 0 --> right
# Shift_idx < 0 --> left
richtung = ifelse(shift_idx > 0, "right", "left")
} else if (abs(shift_idx) == 1) {
score = score + 1
richtung = ifelse(shift_idx > 0, "right", "left")
}
}
### 2) Difference in number of maximum between RNAse and control (max 2 points for score)
max_kontrolle = suppressWarnings(as.numeric(row[["Anzahl_Kontroll_Maxima"]]))
max_rnase = suppressWarnings(as.numeric(row[["Anzahl_RNase_Maxima"]]))
if (!is.na(max_kontrolle) && !is.na(max_rnase)) {
diff_maxima = max_rnase - max_kontrolle
abs_diff = abs(diff_maxima)
if (abs_diff >= 2) {
score = score + 2
if (is.na(richtung)) richtung = ifelse(diff_maxima > 0, "right", "left")
} else if (abs_diff == 1) {
score = score + 1
if (is.na(richtung)) richtung = ifelse(diff_maxima > 0, "right", "left")
}
}
### 3) p-value (Significance gives plus points for the score)
p_vals = parse_numeric_vector(row[["p_Werte"]])
p_vals_vorhanden = length(p_vals) > 0 && !all(is.na(p_vals))
signifikant = p_vals_vorhanden && any(p_vals < 0.05, na.rm = TRUE)
if (signifikant) {
score = score + 2
} else if (!p_vals_vorhanden) {
score = score + 1
if (is.na(richtung)) richtung = "right"
}
### 4) Difference in peak areas between RNAse and control
flaeche_k = parse_numeric_vector(row[["Fläche_Kontrolle"]])
flaeche_r = parse_numeric_vector(row[["Fläche_RNase"]])
delta_flaeche = sum(flaeche_r, na.rm = TRUE) - sum(flaeche_k, na.rm = TRUE)
if (!is.na(delta_flaeche)) {
if (abs(delta_flaeche) > 30) {
score = score + 2
if (is.na(richtung)) richtung = ifelse(delta_flaeche > 0, "right", "left")
} else if (abs(delta_flaeche) > 15) {
score = score + 1
if (is.na(richtung)) richtung = ifelse(delta_flaeche > 0, "right", "left")
}
# special case: shift_idx == 0 but big difference in Area difference under paek --> Score + 1
if (!is.na(shift_idx) && shift_idx == 0 && abs(delta_flaeche) > 30) {
score = score + 1
}
# Centroid comparison (center of peak areas) between RNAse and control
zentroid_k = get_centroid(flaeche_k)
zentroid_r = get_centroid(flaeche_r)
if (!is.na(zentroid_k) && !is.na(zentroid_r)) {
centroid_diff = zentroid_r - zentroid_k
if (abs(centroid_diff) >= 2) {
score = score + 1
if (is.na(richtung)) richtung = ifelse(centroid_diff > 0, "right", "left")
}
}
}
### 5) Compare summed loss and gain of RNase and control amplitudes, but only if p-value of amplitude intervals is significant
if (signifikant) {
sum_gewinn = suppressWarnings(as.numeric(row[["Summierter_Gewinn"]]))
sum_verlust = suppressWarnings(as.numeric(row[["Summierter_Verlust"]]))
if (!is.na(sum_gewinn) && !is.na(sum_verlust)) {
diff_sum = sum_gewinn - sum_verlust
if (abs(diff_sum) > 10) {
score = score + 1
}
}
}
### 6) Compare the position of the amplitudes, but only if the p-value of the amplitude is significant. If amplitude remains the same then no point for score
if (signifikant) {
verlust_vec = parse_numeric_vector(row[["Verlust_Amplituden"]])
gewinn_vec = parse_numeric_vector(row[["Gewinn_Amplituden"]])
if (length(verlust_vec) > 0 && length(gewinn_vec) > 0) {
v_idx = which.max(verlust_vec)
g_idx = which.max(gewinn_vec)
if (!is.na(v_idx) && !is.na(g_idx) && v_idx != g_idx) {
if (is.na(richtung)) {
richtung = ifelse(g_idx > v_idx, "right", "left")
}
score = score + 1
}
}
}
if (is.na(richtung)) richtung = "unbekannt"
### define the scores
if (score >= 5) {
kategorie = paste0("strong_", richtung, "_shift")
} else if (score >= 4) {
kategorie = paste0("moderate_", richtung, "_shift")
} else {
kategorie = "no_shift"
}
return(list(Score = score, category = kategorie))
}
# Apply and save the function across all proteins and their categories
ergebnisse_liste = apply(df_ergebnisse, 1, klassifiziere_protein_shift_streng_v2)
#Add Shift_Score and Shift_Category to df_ergebnisse
df_ergebnisse$Shift_Score = sapply(ergebnisse_liste, function(x) x$Score)
df_ergebnisse$Shift_Category = sapply(ergebnisse_liste, function(x) x$category)
zeile = 'ACOT9_HUMAN'
png("Postervisuals/protein_distribution.png", width = 3200, height = 2000, res = 300)
# Combine y-values for both treatments
alle_werte = c(Ctrl1_norm[zeile,], RNAse1_norm[zeile,])
ylim_bereich = range(alle_werte, na.rm = TRUE)
# to get more place at the top (e.g. 15% more)
ylim_bereich[2] = ylim_bereich[2] * 1.15
# first plot for control
plot(1:25, Mittelwerte_Kontrolle[zeile,], type = "n",
xlab = "Fractions", ylab = "Intensity of Proteins", main = paste("Protein distribution:", zeile),
ylim = ylim_bereich, cex.lab=1.4, cex.main=1.6, cex.axis=1.2)
# stain area under the curve
polygon(c(1:25, 25:1),
c(Mittelwerte_Kontrolle[zeile,], rep(0, 25)),
col = adjustcolor(treatment_colors["Control"], alpha.f = 0.3), border = NA)
lines(1:25, Mittelwerte_Kontrolle[zeile,], col = treatment_colors["Control"], lwd = 2)
polygon(c(1:25, 25:1),
c(Mittelwerte_RNAse[zeile,], rep(0, 25)),
col = adjustcolor(treatment_colors["RNase"], alpha.f = 0.3), border = NA)
lines(1:25, Mittelwerte_RNAse[zeile,], col = treatment_colors["RNase"], lwd = 2)
legend("topright", legend = c("Control", "RNAse"),
col = treatment_colors, lwd = 2, cex = 0.9, inset = c(0, 0.02), yjust = 0)
# Find the index of the current protein in the tables
idx_ctrl = which(abs_Maxima_Mittelwerte_Kontrolle[, 1] == zeile)
idx_rnase = which(abs_Maxima_Mittelwerte_RNAse[, 1] == zeile)
# Get maxima for control from table
max_fraktion_ctrl = as.numeric(as.character(abs_Maxima_Mittelwerte_Kontrolle[idx_ctrl, 3]))
max_wert_ctrl = as.numeric(as.character(abs_Maxima_Mittelwerte_Kontrolle[idx_ctrl, 4]))
# Get maxima for RNAse from table
max_fraktion_rnase = as.numeric(as.character(abs_Maxima_Mittelwerte_RNAse[idx_rnase, 3]))
max_wert_rnase = as.numeric(as.character(abs_Maxima_Mittelwerte_RNAse[idx_rnase, 4]))
# Vertical arrows for the maxima
arrows(x0 = max_fraktion_ctrl, y0 = 0,
x1 = max_fraktion_ctrl, y1 = max_wert_ctrl,
col = treatment_colors["Control"], lwd = 1.5, length = 0.1)
arrows(x0 = max_fraktion_rnase, y0 = 0,
x1 = max_fraktion_rnase, y1 = max_wert_rnase,
col = treatment_colors["RNase"], lwd = 1.5, length = 0.1)
# Labels on the maxima
text(x = max_fraktion_ctrl, y = max_wert_ctrl + 0.10 * max_wert_ctrl,
labels = "Amplitude", col = treatment_colors["Control"], cex = 1.2)
text(x = max_fraktion_rnase, y = max_wert_rnase + 0.05 * max_wert_rnase,
labels = "Amplitude", col = treatment_colors["RNase"], cex = 1.2)
# Shift distance arrow between the maxima
arrow_y = min(max_wert_ctrl, max_wert_rnase)
arrows(x0 = max_fraktion_ctrl, y0 = arrow_y,
x1 = max_fraktion_rnase, y1 = arrow_y,
col = "black", lwd = 1.6, angle = 15, length = 0.1, code = 3) # code=3 = both ends
# Label for the shift
text(x = mean(c(max_fraktion_ctrl, max_fraktion_rnase)),
y = arrow_y + 0.05 * arrow_y,
labels = "Shift Distance", col = "black", cex = 1.2)
invisible(dev.off())
The general goal of the data modelling is on the one hand to evaluate how well the classification of RBPs oder non-RBPs based on the shift score is. On the other hand the main goal is to further analyze and characterize the identified RNA-dependent proteins by their biophysical characteristics which is done by performing linear and logistic regression analyses.
raw = readLines("Datensätze/Table_HS_RBP.txt")
raw_data = raw[7:length(raw)]
clean_data = gsub('\\"', '', raw_data)
writeLines(clean_data, "cleaned_table.txt")
Table_HS_RBP = read.delim("cleaned_table.txt",
sep = "\t",
header = TRUE,
stringsAsFactors = FALSE,
check.names = FALSE)
raw = readLines("Datensätze/Table_HS_Non_RBP.txt")
raw_data = raw[7:length(raw)]
clean_data = gsub('\\"', '', raw_data)
writeLines(clean_data, "cleaned_table.txt")
Table_HS_NO_RBP = read.delim("cleaned_table.txt",
sep = "\t",
header = TRUE,
stringsAsFactors = FALSE,
check.names = FALSE)
# Combining the tables "RBP_Table" and "No_RBP_Table"
rbp_cols = colnames(Table_HS_RBP)
#selecting columns excisting in both tables
existing_no_rbp_cols = intersect(rbp_cols, colnames(Table_HS_NO_RBP))
# NO_RBP_Table with the same columns
No_RBP_Table_aligned = Table_HS_NO_RBP[, existing_no_rbp_cols]
# Adding missing columns with NA
missing_cols = setdiff(rbp_cols, existing_no_rbp_cols)
for (col in missing_cols) {
No_RBP_Table_aligned[[col]] = NA
}
# Same order of the columns in both tables
No_RBP_Table_aligned = No_RBP_Table_aligned[, rbp_cols]
# Categorizing the proteins based on their original table
Table_HS_RBP$Kategorie = "RBP"
No_RBP_Table_aligned$Kategorie = "NO_RBP"
# Keeping the same entries a sin Table_MS_norm
RBP_Table = Table_HS_RBP[Table_HS_RBP$Entry_Name %in% rownames(MS_Table_norm), ]
No_RBP_Table = No_RBP_Table_aligned[No_RBP_Table_aligned$Entry_Name %in% rownames(MS_Table_norm), ]
# Combinig both tables for further analysis
combined_table = rbind(RBP_Table, No_RBP_Table)
# Loading data from Uniprot for more characteristics of the proteins
df_uniprot = read_tsv("Datensätze/Uniprot_infos2.tsv", comment = "#", quote = "")
df_uniprot_counts = df_uniprot %>%
mutate(
Zinc_Count = ifelse(is.na(`Zinc finger`) | `Zinc finger` == "", NA, lengths(strsplit(`Zinc finger`, ";\\s*"))),
Coil_Count = ifelse(is.na(`Coiled coil`) | `Coiled coil` == "", NA, lengths(strsplit(`Coiled coil`, ";\\s*"))),
Signal_Count = ifelse(is.na(`Signal peptide`) | `Signal peptide` == "", NA, lengths(strsplit(`Signal peptide`, ";\\s*"))),
Glyco_Count = ifelse(is.na(`Glycosylation`) | `Glycosylation` == "", NA, lengths(strsplit(`Glycosylation`, ";\\s*"))),
Disulfide_Count = ifelse(is.na(`Disulfide bond`) | `Disulfide bond` == "", NA, lengths(strsplit(`Disulfide bond`, ";\\s*"))),
TM_Count = ifelse(is.na(`Transmembrane`) | `Transmembrane` == "", NA, lengths(strsplit(`Transmembrane`, ";\\s*")))
) %>%
select(Entry, `Entry Name`, Zinc_Count, Coil_Count, TM_Count)
# Comining the new characterisitics
combined_table = combined_table %>%
left_join(df_uniprot_counts, by = c("Entry_Name" = "Entry Name"))
# adding the biophysical characteristics of the proteins in df_ergebnisse for the regression analysis
phys_traits = combined_table[, c("Entry_Name", "Mass_kDa", "Length_AA", "pI", "Listing_Count","Zinc_Count", "Coil_Count", "RBP2GO_Score", "TM_Count")]
df_ergebnisse_phys = merge(df_ergebnisse, phys_traits,
by.x = "Protein",
by.y = "Entry_Name", all = FALSE)
df_ergebnisse_phys$RBP_Status = ifelse(df_ergebnisse_phys$Listing_Count >= 5, "RBP", "NO_RBP")
# Classification of the proteins as RBP/ non-RBP based on their shift scores
df_ergebnisse_phys$RBP_nach_Score = ifelse(df_ergebnisse_phys$Shift_Category == "no_shift", "NO_RBP", "RBP")
df_ergebnisse_phys$RBP_nach_Score = factor(df_ergebnisse_phys$RBP_nach_Score, levels = c("NO_RBP", "RBP"))
To compare the classification of proteins based on the Shift Score (predicted RBP status) with the reference classification from a database (R-DeeP) (actual RBP status), a confusion matrix is used. Based on this, standard classification performance metrics, including accuracy, precision, recall, specificity, and the F1 score are calculated.
conf_matrix = table(df_ergebnisse_phys$RBP_Status, df_ergebnisse_phys$RBP_nach_Score)
colnames(conf_matrix) = c("predicted_NO_RBP", "predicted_RBP")
rownames(conf_matrix) = c("actual_NO_RBP", "actual_RBP")
conf_matrix
##
## predicted_NO_RBP predicted_RBP
## actual_NO_RBP 4719 659
## actual_RBP 1065 617
TN = 4719
FP = 659
FN = 1065
TP = 617
# fuction to calculate the classification metrics
calc_metrics = function(TP, TN, FP, FN) {
accuracy = (TP + TN) / (TP + TN + FP + FN)
PPV = TP / (TP + FP)
sensitivity = TP / (TP + FN)
specifity = TN / (TN + FP)
f1_score = 2 * (PPV * sensitivity) / (PPV + sensitivity)
# Ergebnisse als Dataframe
metrics_df = data.frame(
Metric = c("accuracy", "PPV", "sensitivity", "specifity", "f1_score"),
Value = c(accuracy, PPV, sensitivity, specifity, f1_score)
)
return(metrics_df)
}
testvalues = calc_metrics(TP, TN, FP, FN)
print(testvalues)
## Metric Value
## 1 accuracy 0.7558074
## 2 PPV 0.4835423
## 3 sensitivity 0.3668252
## 4 specifity 0.8774637
## 5 f1_score 0.4171738
# Assign RBP status
df_ergebnisse_phys = df_ergebnisse_phys %>%
mutate(RBP_Gruppe = ifelse(Shift_Category == "no_shift", "non-RBP", "RBP"))
# Calculate sums of proteins per RBP_Group + Shift_Category
df_summary = df_ergebnisse_phys %>%
group_by(RBP_Gruppe, Shift_Category) %>%
summarise(n = n(), .groups = "drop") %>%
mutate(
# Category label with number for the legend
Kategorie_mit_n = paste0(Shift_Category, " (", n, ")")
)
shift_colors_named = setNames(
shift_colors[df_summary$Shift_Category],
df_summary$Kategorie_mit_n
)
proteins_per_shift = ggplot(df_summary, aes(x = RBP_Gruppe, y = n, fill = Kategorie_mit_n)) +
geom_col(position = "stack") +
scale_fill_manual(values = shift_colors_named) +
labs(title = "Total Amount of Proteins per Shift category",
x = "Protein Type", y = "Total Amount of Proteins", fill = "Shift-Category (n)") +
theme_poster
ggsave("Postervisuals/proteins_per_shift.png", plot = proteins_per_shift, width = 20, height = 12.5, units = "cm", dpi = 300)
print(proteins_per_shift)
Several linear regression analyses were performed to find correlations between different biophysical characteristics of the proteins and the Shift score. Many of those analyses are not shown anymore because they all did not identify significant predictors for the shift score. One linear regression analysis is performed with the isoelectric point and mass of the proteins as predictors for their shift score. The goal of this regression analysis is to find out whether those two biophysical traits have a significant impact on the Shift_Score and therefore on the categorization of the proteins as RBP or NO_RBP.
model_linear = lm(Shift_Score ~ pI + Mass_kDa, data = df_ergebnisse_phys)
summary(model_linear)
##
## Call:
## lm(formula = Shift_Score ~ pI + Mass_kDa, data = df_ergebnisse_phys)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.878 -1.565 -1.062 1.048 8.162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0325538 0.1110707 -0.293 0.769461
## pI 0.2278878 0.0144892 15.728 < 2e-16 ***
## Mass_kDa 0.0011384 0.0002978 3.823 0.000133 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.123 on 7033 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.03417, Adjusted R-squared: 0.0339
## F-statistic: 124.4 on 2 and 7033 DF, p-value: < 2.2e-16
Both, the isoelectric point as well as the mass have a significant correlation with the Shift_Score because the p-value for both coefficients is significant. The correlation between those two characteristics and the shift score is visualized below.
df_plot = df_ergebnisse_phys[complete.cases(df_ergebnisse_phys[, c("RBP_nach_Score", "Mass_kDa", "pI")]), ]
df_plot = df_plot[df_plot$Mass_kDa < 2000, ]
df_plot$predicted_score = predict(model_linear, newdata = df_plot, type = "response")
shift_score_mass_pI = ggplot(df_plot, aes(x = Mass_kDa, y = pI, color = predicted_score)) +
geom_point(size = 2, alpha = 0.7) +
scale_color_gradientn(colours = gradient_colour1) +
labs(title = "Shift_Score based on mass and pI",
x = "Mass (kDa)", y = "Isoelectric point (pI)",
color = "Score") +
theme_poster
ggsave("Postervisuals/shift_score_mass_pI.png", plot = shift_score_mass_pI, width = 22, height = 13.5, units = "cm", dpi = 300)
print(shift_score_mass_pI)
# density distribution of pI and mass indicating a difference between RBP and NO_RBP
ggplot(df_ergebnisse_phys, aes(x = Mass_kDa, y = pI)) +
stat_density_2d(aes(fill = ..level..), geom = "polygon") +
facet_wrap(~RBP_nach_Score) +
labs(x = "Mass (kDa)", y = "pI", title = "Density of mass and pI for RBP classification") +
theme_minimal()
The R-squared value of the regression analysis is 0.03363 indicating
that the isoelectric point and mass alone are not sufficient enough to
predict the Shift_Score of the proteins. This result is visualized
below. Both traits have a linear correlation with the shift score but
there is a big deviation noticeable showing that these traits alone do
not eexplain the Shift_Score.
ggplot(df_plot, aes(x = pI, y = Shift_Score, color = Shift_Category)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
scale_color_manual(values = shift_colors, name = "Shift Category") +
labs(title = "Influence of pI on shift score",
x = "Isoelectric point (pI)",
y = "Shift score") +
theme_poster
ggplot(df_plot, aes(x = Mass_kDa, y = Shift_Score, color = Shift_Category)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
scale_color_manual(values = shift_colors, name = "Shift Category") +
labs(title = "Influence of mass on shift score",
x = "Mass (kDa)",
y = "Shift score") +
theme_poster
Performing a data reduction and cluster analysis based on the coefficients of the linear regression analysis, the goal is to find out the number of clusters and comparing those with the assignment of the proteins in different shift categories.
# Choosing the variables for the clusteriing
vars = c("Mass_kDa", "pI", "Shift_Score")
df_clust = df_ergebnisse_phys[complete.cases(df_ergebnisse_phys[, vars]), ]
# Scaling to prepare for clustering
scaled_data = scale(df_clust[, vars])
# k-means clustering
set.seed(42) # für Reproduzierbarkeit
kmeans_result = kmeans(scaled_data, centers = 4)
# Saving cluster assignment
df_clust$regression_cluster = factor(kmeans_result$cluster)
# PCA to perform a data reduction based on the proteins with selected variables for the clustering
pca = prcomp(scaled_data, scale. = TRUE)
# Data for plotting
plot_df = data.frame(PC1 = pca$x[,1],
PC2 = pca$x[,2],
Cluster = df_clust$regression_cluster)
# Plot
ggplot(plot_df, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 2, alpha = 0.7) +
theme_poster +
labs(title = "Clusteriing based on regression analysis",
x = paste0("PC1 (", round(summary(pca)$importance[2,1]*100,1), "%)"),
y = paste0("PC2 (", round(summary(pca)$importance[2,2]*100,1), "%)")) +
scale_color_brewer(palette = "Set2")
This clustering contains an outlier that skews the clustering plot so it has to be removed. The proteins are clustered again and the silhouette plot shows which the best number of clusters in this analysis is.
# Detecting the outlier and removing it
outlier_index = which.max(abs(plot_df$PC2))
df_clust_no_outlier = df_clust[-outlier_index, ]
scaled_data_no_outlier = scale(df_clust_no_outlier[, vars])
set.seed(42)
kmeans_result = kmeans(scaled_data_no_outlier, centers = 4)
# New cluster assignment
df_clust_no_outlier$regression_cluster = factor(kmeans_result$cluster)
pca_no_outlier = prcomp(scaled_data_no_outlier, scale. = FALSE)
plot_df_clean = data.frame(
PC1 = pca_no_outlier$x[,1],
PC2 = pca_no_outlier$x[,2],
Cluster = df_clust_no_outlier$regression_cluster,
Protein = df_clust_no_outlier$Protein
)
# Merging of shift category by protein name in the dataframe for plotting
plot_df_clean = merge(plot_df_clean,
df_ergebnisse_phys[, c("Protein", "Shift_Category")],
by = "Protein", all.x = TRUE)
ggplot(plot_df_clean, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 2, alpha = 0.8) +
theme_poster +
labs(title = "Clustering by pI, mass and shift score ") +
scale_color_brewer(palette = "Set2")
# Silhouette plot for the clustering
# distance matrix
dists = dist(scaled_data_no_outlier)
# Silhouette with k-means assignments
colors_set2 = brewer.pal(4, "Set2")
sil = silhouette(as.numeric(df_clust_no_outlier$regression_cluster), dists)
plot(sil, main = "Silhouette-Plot for K-Means-Clustering (k = 4)", col = colors_set2, border = NA)
Combining the clustering information with the assignment of each protein to a shift category, the goal is to find out if there is a visible structure in the clustering based on the information of the shift category. The result of the clustering shows that pI and mass contribute to the shift score without distorting the underlying classification meaning that it was successful to visualize that pI and mass explain some of variance of the shift score. Therefore, the result is that especially left shift proteins have a different isoelectric point than proteins with no shift as they are the most abundant in cluster 4 and for that reason thy are the most separated from the non-shifting proteins.
# Most abundant shift categroy in each cluster
# absolute values
summary_per_cluster = df_clust_no_outlier %>%
group_by(regression_cluster, Shift_Category) %>%
summarise(n = n(), .groups = "drop") %>%
arrange(desc(n))
#relative values
summary_per_cluster_rel = summary_per_cluster %>%
group_by(regression_cluster) %>%
mutate(
Anteil = n / sum(n)
) %>%
ungroup()
# Proportion of RBP and NO_RBP in each cluster
summary_per_cluster_RBP = df_clust_no_outlier %>%
group_by(regression_cluster, RBP_nach_Score) %>%
summarise(n = n(), .groups = "drop") %>%
arrange(desc(n))
summary_per_cluster_rel_RBP = summary_per_cluster_RBP %>%
group_by(regression_cluster, RBP_nach_Score) %>%
mutate(
Anteil = n / sum(n)
) %>%
ungroup()
# Statistical test to see if the clusters distinguish in shift category
# Contingency table
table_shift_cluster = table(df_clust_no_outlier$regression_cluster, df_clust_no_outlier$Shift_Category)
table_shift_cluster_RBP = table(df_clust_no_outlier$regression_cluster, df_clust_no_outlier$RBP_nach_Score)
fisher.test(table_shift_cluster_RBP, workspace = 2e8)
##
## Fisher's Exact Test for Count Data
##
## data: table_shift_cluster_RBP
## p-value < 2.2e-16
## alternative hypothesis: two.sided
# Chi²-test for shift categories
chisq.test(table_shift_cluster)
##
## Pearson's Chi-squared test
##
## data: table_shift_cluster
## X-squared = 5874, df = 12, p-value < 2.2e-16
# Verteilung der shift Kategorien unterscheidet sich signifikant zwischen den Clustern
# Chi²-test for RBP classification in clusters
# ebenfalls signifikant
chisq.test(table_shift_cluster_RBP)
##
## Pearson's Chi-squared test
##
## data: table_shift_cluster_RBP
## X-squared = 5562, df = 3, p-value < 2.2e-16
Both, the shift categories and RBP classification distinguish in between the clusters. These results are visualized below by coloring the proteins by shift category.
cluster_shift_score_mass_pI = ggplot(plot_df_clean, aes(x = PC1, y = PC2, color = Shift_Category)) +
geom_point(alpha = 0.7, size = 2) +
scale_color_manual(values = shift_colors, name = "Shift Category") +
labs(title = "PCA: Clustering by pI, mass and shift score") +
theme_poster
ggsave("Postervisuals/cluster_shift_score_mass_pI.png", plot = cluster_shift_score_mass_pI, width = 15, height = 9.3, units = "cm", dpi = 300)
print(cluster_shift_score_mass_pI)
The distribution of each shift category in each cluster is visualzed below.
boxplots_cluster_pI = ggplot(df_clust_no_outlier, aes(x = Shift_Category, y = pI, fill = Shift_Category)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = shift_colors, name = "Shift Category") +
facet_wrap(~ regression_cluster) +
labs(title = "pI-values pro shift-category in each cluster") +
theme_poster %+replace% theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
ggsave("Postervisuals/boxplots_cluster_pI.png", plot = boxplots_cluster_pI, width = 15.5, height = 10, units = "cm", dpi = 300)
print(boxplots_cluster_pI)
To show the difference in the isoelectric point in each shift category, a violin plot is rendered.
violinplot_pI = ggplot(df_clust, aes(x = Shift_Category, y = pI,fill = Shift_Category)) +
geom_violin(alpha = 0.8) +
geom_boxplot(width = 0.2, outlier.shape = NA, fill = "white") +
scale_fill_manual(values = shift_colors) +
labs(title = "Isoelectric point in each shift category",
x = "Shift category",
y = "pI") +
theme_poster %+replace% theme(legend.position = "none")
ggsave("Postervisuals/violinplot_pI.png", plot = violinplot_pI, width = 8, height = 6, dpi = 600)
violinplot_pI
Below there is a density distribution of the isoelectric point for the two categories “RBP” and “NO_RBP”. It is another visualization to show the difference in the isoelectric point for the different categories.
ggplot(df_ergebnisse_phys, aes(x = pI, fill = RBP_nach_Score)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = color_rbp) +
theme_poster +
labs(title = "pI-distribution for RBP and non-RBP", x = "pI", y = "density")
# Identifying the actual value of the isoelectric point for RBP and NO_RBP
# Visible difference between the mean isoelectric points of the classifications
df_ergebnisse_phys %>%
group_by(RBP_Status) %>%
summarise(
mean_pI = mean(pI, na.rm = TRUE),
median_pI = median(pI, na.rm = TRUE),
n = n()
) %>%
arrange(desc(mean_pI))
## # A tibble: 2 × 4
## RBP_Status mean_pI median_pI n
## <chr> <dbl> <dbl> <int>
## 1 RBP 7.47 7.11 1682
## 2 NO_RBP 6.91 6.58 5378
The violin plot for the log10(mass) in each shift category shows that there is not a significant difference visible. Without logarithmizing the mass the violin plots were not visible.
violinplot_mass = ggplot(df_clust, aes(x = Shift_Category, y = log10(Mass_kDa), fill = Shift_Category)) +
geom_violin(alpha = 0.4) +
geom_boxplot(width = 0.2, outlier.shape = NA, fill = "white") +
scale_fill_manual(values = shift_colors) +
labs(
title = "Log10(Mass) in each shift category",
x = "Shift-category",
y = expression(log[10]*"(Mass in kDa)")
) +
theme_poster %+replace% theme(legend.position = "none")
violinplot_mass
Finding 570 proteins that have an information about a zinc finger motive and are also in the dataset that this analysis is based on, led to the goal of searching for a correlation between a zinc finger motive and the RBP classification. A linear regression analysis with the zinc finger motive was not successful so that a logistic regression is performed with the target variable of RBP classification in RBP or NO_RBP instead of the shift score in the linear regression analysis. Because the isoelectric point was already identified as influencial on both the shift score and the RBP classification, the pI is also included in the logistic regression analysis.
df_ergebnisse_phys$RBP_nach_Score = factor(df_ergebnisse_phys$RBP_nach_Score, levels = c("NO_RBP", "RBP"))
model_logit = glm(RBP_nach_Score ~ Zinc_Count + pI,
data = df_ergebnisse_phys,
family = "binomial")
summary(model_logit)
##
## Call:
## glm(formula = RBP_nach_Score ~ Zinc_Count + pI, family = "binomial",
## data = df_ergebnisse_phys)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.749002 0.535560 -5.133 2.85e-07 ***
## Zinc_Count -0.022143 0.009506 -2.329 0.01984 *
## pI 0.212084 0.069833 3.037 0.00239 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 570.01 on 568 degrees of freedom
## Residual deviance: 556.06 on 566 degrees of freedom
## (6491 observations deleted due to missingness)
## AIC: 562.06
##
## Number of Fisher Scoring iterations: 4
df_pred = df_ergebnisse_phys[complete.cases(df_ergebnisse_phys[, c("Zinc_Count", "pI")]), ]
df_pred$fitted_Status = predict(model_logit, newdata = df_pred, type = "response")
# predicted probability for RBP based on pI and zinc finger motive
df_plot = df_ergebnisse_phys[complete.cases(df_ergebnisse_phys[, c("RBP_nach_Score", "Zinc_Count", "pI")]), ]
df_plot$predicted_prob = predict(model_logit, type = "response")
ggplot(df_plot, aes(x = Zinc_Count, y = pI, color = predicted_prob)) +
geom_point(size = 2, alpha = 0.7) +
scale_color_gradientn(colors = gradient_colour2) +
labs(title = "Predicted probability for RBP",
x = "Amount of zinc finger", y = "Isoelectric point (pI)",
color = "P(RBP)") +
theme_poster
# Visualizing pI and zinc finger motives seperately
# Densitiy distribution of pI and zinc finger motives for each RBP score
# Zinc_Count
ggplot(df_ergebnisse_phys, aes(x = Zinc_Count, fill = RBP_nach_Score)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = color_rbp) +
labs(title = "Densitiy distribution of zinc_count for each RBP score",
x = "Zinc_Count", fill = "RBP classification") +
theme_poster
# pI
ggplot(df_ergebnisse_phys, aes(x = pI, fill = RBP_nach_Score)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = color_rbp) +
labs(title = "Densitiy distribution of pI for each RBP score",
x = "Isoelectric point (pI)", fill = "RBP classification") +
theme_poster
# Extracting the amount of RBPs with zinc motive
amount_RBP = sum(df_ergebnisse_phys$RBP_nach_Score == "RBP")
amount_RBP_zinc = sum(df_ergebnisse_phys$Zinc_Count > 0 & df_ergebnisse_phys$RBP_nach_Score == "RBP", na.rm = TRUE)
# Extracting the amount of NO_RBPs with zinc motive
amount_NO_RBP = sum(df_ergebnisse_phys$RBP_nach_Score == "NO_RBP")
amount_NO_RBP_zinc = sum(df_ergebnisse_phys$Zinc_Count > 0 & df_ergebnisse_phys$RBP_nach_Score == "NO_RBP", na.rm = TRUE)
# Percentage of both
prop_RBP_zinc = amount_RBP_zinc / amount_RBP * 100
prop_NO_RBP_zinc = amount_NO_RBP_zinc / amount_NO_RBP * 100
table(prop_RBP_zinc,prop_NO_RBP_zinc)
## prop_NO_RBP_zinc
## prop_RBP_zinc 7.88381742738589
## 8.93416927899687 1
# Visualization of the percentages of zinc finger motives in RBPs and NO_RBPs
df_prop = data.frame(
group = c("RBP", "NO_RBP"),
percentage_zinc = c(prop_RBP_zinc, prop_NO_RBP_zinc)
)
proteins_zinc_motiv = ggplot(df_prop, aes(x = group, y = percentage_zinc, fill = group)) +
geom_bar(stat = "identity", width = 0.6, alpha = 0.8) +
scale_fill_manual(values = color_rbp) +
scale_y_continuous(labels = function(x) paste0(x, "%"), limits = c(0, max(df_prop$percentage_zinc) + 5)) +
labs(title = "Amount of proteins with zinc finger motive",
y = "Proteins shift_colors_named (%)",
x = "Classification") +
theme_poster %+replace% theme(legend.position = "none")
# Save plot as PNG
ggsave(filename = "Postervisuals/proteins_zinc_motiv.png", plot = proteins_zinc_motiv, width = 11.7, height = 7.5, units = "cm", dpi = 600)
print(proteins_zinc_motiv)
# Statistical test (wilcoxon test) to see if there is a significant difference of zinc finger amount in RBP and NO_RBP
# shows that there is a significant difference
wilcox.test(Zinc_Count ~ RBP_nach_Score, data = df_ergebnisse_phys)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Zinc_Count by RBP_nach_Score
## W = 30634, p-value = 0.002216
## alternative hypothesis: true location shift is not equal to 0
Through protein classification into shift categories, we could distinguish RBPs from non-RBPs. Compared to previous studies1,3, 53% of RBPs were classified as non-RBPs, and 18% of non-RBPs as RBPs. The test statistics show high specificity (88%) but low sensitivity (36%), suggesting limited detection of all proteins by the function. This is a limited assumption, as studies only separate RBPs and non-RBPs. Many proteins may be RNA-dependent without directly binding RNA.
The results of the linear regression indicate that the isoelectric point differs in RBPs compared to non-RBPs. The pI is higher in RBPs than in non-RBPs. That coincides with the expectation that RBPs are charged more positively in order to interact with negatively charged RNA. The higher pI in RBPs was also found in literature proving the accuracy of the analysis.
The results of the cluster analysis have to be regarded carefully because the clustering is partially based on the shift score. The shift categories result out of the shift score so the only conclusion from the clustering is that the biophysical characteristics do not distort the classification – not that they explain the classification perfectly. Nevertheless those results contribute to the outcomne of the linear regression that there is a difference between RNA-dependent proteins (in the code categorized as RBP) and RNA-independent proteins (in the code categorized as NO_RBP) in their isoelectric point. Although the mass has a significant p-value in the regression analysis, no actual significant difference between RBPs and no RBPs in their mass is detected.
The conclusion from the logistic regression is that zinc finger motives probably occur more often in RBPs than in non-RBPs. That result is also found in literture and fits the expectation that te motive is one way to bind to RNA.
To further analyze the characteristics of RNA-dependent proteins, more data would have been needed as the amount of information for each characterisitc often was not sufficient in the data that was used.
Overall, the results of the whole analysis of mass spectrometry data can be regarded as successful because categorizing the proteins in RNA-dependent and RNA-independent by their shifting behaviour is mostly compliant to databases and the expectations of specific RNA-dependent characteristics are met. The code of RNA-dependency therefore is cracked to some extend.